-
Notifications
You must be signed in to change notification settings - Fork 12
Vector Store, full separation codec / vectorstore #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…it cleanly passes through the coded layer all the time
| """ | ||
| return _default_encode_id(filter_id) | ||
|
|
||
| def encode_ids(self, filter_ids: list[str]) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two potential issues with this being a separate method:
- For a user that has a query for "these IDs AND this predicate" they're going to need to write the
_idfilter themselves unless they have access to the codec. - We'd also need the rewriting that happens to special case the
_idfilter and not rewrite it tometadata._id.
I wonder if we should just adopt a $id or _id as the standar field for the id. Then, I don't know if we'd need the encode_ids -- we could just make the rewriter do the right thing in that case, and the user can provide { "$id": <id> } or { "$id": { "$in": [<ids>] }} depending on their needs in arbitrary places within the query.
| doc_id = self.document_codec.get_id(document) | ||
| return await _async_collection.replace_one( | ||
| {"_id": document["_id"]}, | ||
| self.document_codec.encode_id(doc_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of allowing the codec to just be a filter rewriter actually would be pretty useful here as well. It would allow these to just be "do a query with codec.rewrite_filters({"$id": doc_id}) (don't remember the name it currently has) which feels like it's clearer what is being done. Then instead of having a bunch of methods on the codec that are used for creating queries, there is just the one that converts a "standard" query (on the Document basically) into a appropriately encoded query.
|
@bjchambers Thank you for taking the time to review this. You make good points. I have added the "do not merge" label while I go over it again. Here are the two points I take from your remarks.
More on the second point: "any query", in LC world -- where
What I'm trying to get to is, since the codec's "query translator" makes an abstract query into a payload good to go to Data API, its input could be required to be not a dictionary (which arguably makes usage more error-prone), rather a specific structure: instances of some I can probably find some time to rework this PR in this sense tomorrow - would that capture the essence of your remarks? (plus helping with clarity since it requires data classes to express the abstractness of queries? Or perhaps too unwieldy?) |
|
I like that direction and the observation it doesn't all need to be in a dict. That will keep the difference between I wonder if a data class is necessary. Could it just be: I think keeping vector separate (it goes to sort, not the filter) could be reasonable. |
|
Right, and the Also agree that at this point the data class becomes useless weight. You convinced me: no data class :) |
|
I have replaced the "encode_id[s]" for a Some notes:
I have adapted the unit tests ( from langchain_astradb.utils.vector_store_codecs import _DefaultVSDocumentCodec, _FlatVSDocumentCodec
d = _DefaultVSDocumentCodec(content_field='c', ignore_invalid_documents=False)
f = _FlatVSDocumentCodec(content_field='c', ignore_invalid_documents=False)
d.encode_query()
f.encode_query()
# both: {}
d.encode_query(ids=['id1'])
f.encode_query(ids=['id1'])
# both: {'_id': 'id1'}
d.encode_query(ids=['id1', 'id2'])
f.encode_query(ids=['id1', 'id2'])
# both: {'_id': {'$in': ['id1', 'id2']}}
d.encode_query(ids=['d'],filter_dict={'x':'y'})
f.encode_query(ids=['d'],filter_dict={'x':'y'})
# resp.:
# {'$and': [{'_id': 'd'}, {'metadata.x': 'y'}]}
# {'$and': [{'_id': 'd'}, {'x': 'y'}]}
d.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
f.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
# resp.:
# {'$and': [{'_id': 'd'}, {'metadata.x': 'y', 'metadata.z': 'w'}]}
# {'$and': [{'_id': 'd'}, {'x': 'y', 'z': 'w'}]}
d.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
f.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
# resp.:
# {'$and': [{'_id': 'd'}, {'$or': [{'metadata.x': 'y'}, {'metadata.z': 'w'}]}]}
# {'$and': [{'_id': 'd'}, {'$or': [{'x': 'y'}, {'z': 'w'}]}]} |
(an incidental note is that trying to use |
| *, | ||
| ids: Iterable[str] | None = None, | ||
| filter_dict: dict[str, Any] | None = None, | ||
| ) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May merit pydoc indicating the implicit $and.
|
|
||
| if clauses: | ||
| if len(clauses) > 1: | ||
| return {"$and": clauses} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this makes sense. In general, my rationale is:
- If you want
ids OR filter, then run separate queries -- the IDs only query is likely to be fast and the filter-only query is likely to do a scan. - If you want
ids AND filterthen there is no option beyond running them together (unless you somehow emulate the full filtering semantics on the client side).
So, it seems like the AND is the only reasonable choice.
Clean separation of codecs/vectorstore layers
This PR introduces a better separation of knowledge on what pertains to codecs and what constitutes the underlying logic of the Vector Store. As such,
fixes #106
Meanwhile, it provides a couple of useful querying tools which feel like they belong to the vstore+codec layer. (though another such tool, the "run_query" method, is postponed to a follow-up PR. That one is possibly the most important of these.)
Additionally, this restructuring of the code also makes a step toward a possible extension to cover API Tables without duplicating logic.
More in detail:
A note on the name chosen for the
get_idcodec method. There is the unfortunate fact that LangChain calls "documents" its internal format, and Astra DB calls "document" what it stores. So ... codecs that lie between these two have sometimes a hard time with naming. Keeping it simple (get_id) should work in this case because only one of the two ends of the codec (the Astra side) can have a variable schema: then, only on that side should the need arise for a function abstracting the reading of an ID. (Which btw is more a formality as the"_id"field is one of those that will hardly ever change in Astra DB!)